Species Distribution Models (SDMs) correlate species presence or abundance with environmental factors to define species niches and predict their distribution. In this hands-on tutorial, you’ll fit your first SDM using a real-world dataset from FISHGLOB—a comprehensive compilation of fishery-independent trawl surveys spanning the continental shelves of Europe and North America.
1 Data overview
To fit SDMs you generally need two key ingredients: observational data (which species are found where) and environmental layers.
NOTE:Throughout the tutorial the R code is hidden, but if you want to see code, click the “Show the Code” button.
Show the code
# load the librarieslibrary(tidyverse)library(glmmTMB)library(rnaturalearth) #install.packages('rnaturalearth')library(rnaturalearthdata)library(sf)library(ggeffects)library(equatiomatic)library(patchwork)# load the datadata =read_csv('../data/model_input/data.csv')
total_hauls <-n_distinct(data$haul_id) # Calculate the total number of hauls(data %>%filter(present ==1) %>%# Only include present speciescount(species) %>%# Count occurrences per speciesmutate(percentage = n / total_hauls *100) %>%# Calculate percentagearrange(desc(percentage))) # Sort by percentage
data |>distinct(haul_id, year)|>group_by(year) |>summarise(number_of_trawls=n()) |>ggplot(aes(year,number_of_trawls))+geom_line()+scale_x_continuous(breaks= scales::pretty_breaks())
2 Example - modelling Atlantic cod (Gadus morhua) distribution
Warning
In this exercise, we’re skipping model assumption checks—a bad practice! But with limited time, I’d rather spark your curiosity about these models and focus only on interpretation.
Do you think this approach is reasonable, or are we missing something? Can you think of any alternatives to a linear model for describing a species’ relationship with temperature?
2.2.1 Predict, predict, predict!
Load and inspect the prediction grid:
Show the code
grid =read_csv('../data/model_input/grid.csv') # load the gridstr(grid)
Just eyeballing it, based on this model, will the (log)density change a lot under the future scenario, or is it comparable?
Can you think of one simple calculation for comparing the current vs future density?
In this example, we only considered bottom temperature as a driver of species distribution. Can you think of a couple more environmental variables that can influence fish densities?
2.3 A slightly more complex model
What if we want to predict Atlantic cod presence/absence? Well, in this case we need to leave the comfort of linear models and embrace Generalized Linear Models (GLMs).
2.3.1 Link functions
A GLM has two key parts:
the link function, which determines how the mean of the data relates to the predictors
the error distribution, which describes the variability around the mean
The link function is a transformation that makes the (mean of the) response data linear in relation to the predictors. Meanwhile, the error distribution captures how the data spread around this untransformed mean. Here we will focus on the logit link and the Bernoulli distribution as our response data are 0s and 1s. This is commonly referred to as logistic regression.
Show the code
x<-seq(-10, 10, length.out =100)plot(x, plogis(x), type ="l")
partial_effect <-predict_response(my_lr_SDM, terms ="bottom_temp") # this function is smart, it's already on the response scaleplot(partial_effect, ci_style ="dash")+ggtitle('')+ylab('Predicted probability of occurence')
Wait it’s not linear? What is the interpretation of the slope then?
Show the code
summary(my_lr_SDM)
Call:
glm(formula = present ~ 1 + bottom_temp, family = binomial(link = "logit"),
data = filter(data, species == "Gadus_morhua"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.54108 0.18151 30.53 <2e-16 ***
bottom_temp -0.52556 0.01964 -26.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9683.1 on 7788 degrees of freedom
Residual deviance: 8886.5 on 7787 degrees of freedom
AIC: 8890.5
Number of Fisher Scoring iterations: 4
The slope is -0.526.
Well, the relationship is not linear on the probability scale (i.e. response). The logit-transformed true probabilities follow the estimated intercept and slope.
A unit increase in bottom_temp corresponds to a -0.526 increase in the log odds of a 1 (species present) being observed.
Intuition box
I took this from https://github.com/seananderson/glmm-course/blob/master/12-glms.Rmd
If we exponentiate the slope coefficient we get the expected fold increase (decrease) in the odds of observing a 1 (the species being present): 0.59 per unit increase in bottom_temp.
A quick trick is to take the slope of the logistic regression and divide it by 4. This will give you approximately the expected change in probability per unit change in the variable at the steepest part of the line: -0.13.
3 Now it’s you turn!
Open the script 02_fit_my_model_template.R under ‘exercise’ and fit your model(s).
“Statistical Rethinking” is one of the nicest book out there for stats. Also lectures are available online. It’s long (20 lectures 1h 15min each) but absolutely worth every bit of effort.
Ovaskainen O., & Abrego N. (2020). Joint Species Distribution Modelling: With Applications in R. Cambridge: Cambridge University Press.
Thorson, J., & Kristensen, K. (2024). Spatio-Temporal Models for Ecologists (1st ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781003410294
5.2 Stats
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge: Cambridge University Press.
McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608